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Abstract — Given a set of data, biclustering aims at finding 
simultaneous partitions in biclusters of its samples and of the 
features which are used for representing the samples. Consistent 
biclusterings allow to obtain correct classifications of the samples 
from the known classification of the features, and vice versa, and 
they are very useful for performing supervised classifications. 
The problem of finding consistent biclusterings can be seen as a 
feature selection problem, where the features that are not relevant 
for classification purposes are removed from the set of data, while 
the total number of features is maximized in order to preserve 
information. This feature selection problem can be formulated 
as a linear fractional 0-1 optimization problem. We propose a 
reformulation of this problem as a bilevel optimization problem, 
and we present a heuristic algorithm for an efficient solution 
of the reformulated problem. Computational experiments show 
that the presented algorithm is able to find better solutions with 
respect to the ones obtained by employing previously presented 
heuristic algorithms. 

I. Introduction 

Data mining techniques are nowadays much studied, be- 
cause of the growing amount of data which is available and 
that needs to be analyzed. In particular, clustering techniques 
aim at finding suitable partitions of a set of samples in clusters, 
where data are grouped by following different criteria. The 
focus of this paper is biclustering, where samples and features 
in a given set of data are partitioned simultaneously. 

Given a set of samples, each sample in the set can be 
represented by a sequence of features, which are supposed to 
be relevant for the samples. If a set of data contains n samples 
which are represented by m features, then the whole set can 
be represented by an m x n matrix A, where the samples are 
organized column by column, and the features are organized 
row by row. A bicluster is a submatrix of A, which can be 
equivalently defined as a pair of subsets {Sr,Fr), where Sr 
is a cluster of samples, and Fr is a cluster of features. A 
biclustering is then a partition of A in k biclusters: 

B = {{Si,F,),{S2,F2),...,{Sk,Fk)}, 

such that the following conditions are satisfied: 

k 

\JSr = A, 5cn5? = i<C^e<fc, (1) 

r=l 
fc 

[jFr = A, l<C^C<fc, (2) 



where k < min(n, m) is the number of biclusters ||2l, ||6]. Note 
that the conditions ([T]i ensures that Bs = {Si, S2, ■ ■ ■ , Sk} 
is a partition of the samples in disjoint clusters, while the 
conditions (|2]i ensures that Bp = {Fi, F2, . . . , Fk} is a 
partition of the features in disjoint clusters. 

We focus on the problem of finding biclusterings of the set 
of samples and of the set of features. When such biclusterings 
can be found, not only clusters of samples are obtained (as in 
standard clustering), but, in addition, the features causing the 
partition of samples in these clusters are also identified. This 
information is very interesting in many real-life applications. 
In particular, biclustering techniques are widely applied for 
analyzing gene expression data, where samples represent par- 
ticular conditions (for example, the presence or absence of a 
disease), and each sample is represented by a sequence of gene 
expressions. In this case, finding out which features (genes) 
are related to the samples can help in discovering information 
about diseases Q, ||9l- 

The concept of consistent biclustering is very important in 
this domain 0. Let us consider a set of samples, and let 
us suppose that a certain classification is assigned to such 
samples. In other words, we know a partition in clusters 
of these samples: Bs = {Si, S2, ■ ■ ■ , Sk}- A classification 
for the corresponding features, i.e. for the features used for 
representing these samples, can be obtained from Bs (see 
Section HI] for details). Let us refer to this partition of the 
features with Bp = {Fi, F2, . . . , Fk}. Then, the procedure 
can be inverted, and from the obtained classification Bp of 
the features, another classification for the samples can be 
computed: Bs — {Si, S2, ■ ■ ■ , Sk}- In general, Bs and Bs 
differ. In the event in which they instead coincide, the biclus- 
tering B = {(^1, Fi), (^2,^2), . . . , iSk,Fk)} is referred to as 
a consistent biclustering. 

Consistent biclusterings can be used for classification pur- 
poses. Let us suppose that a training set is available for a 
certain classification problem. In other words, we suppose that 
a set of samples, whose classification is known, is available. 
From the classification of the samples, a classification of the 
features can be found, and then a certain biclustering, as 
explained above. If this biclustering is consistent, then the 
original classification of the samples in the training set can be 
reconstructed from the classification of its features. Therefore, 
the classification of these features can also be exploited for 
finding a classification for other samples, which originally 



have no known classification. 

Unfortunately, sets of data allowing for consistent biclus- 
terings are quite rare. There are usually features that are not 
relevant for the classification of the samples, which can easily 
bring to misclassifications. Because of experimental errors or 
noise, these features could be assigned to a bicluster or another, 
and this uncertainty causes errors in the classifications. For 
avoiding this, all the features that are not relevant must be 
removed. Therefore, we are interested in selecting a certain 
subset of features for which a consistent biclustering can be 
found. Since it is preferable to keep the loss of information as 
low as possible, the number of features to be selected has to 
be the maximum possible. 

The feature selection problem related to consistent bi- 
clustering is NP-hard |8). It can be formulated as a 0-1 
linear fractional optimization problem, which can be very 
difficult to solve. In particular, for large (real-life) sets of 
data, the corresponding optimization problem is also large, 
and therefore there are no examples in the literature in which 
deterministic techniques have been employed. In llT2ll . two 
heuristic algorithms have been proposed for solving the 0-1 
linear fractional optimization problem arising in the context 
of feature selection by biclustering. 

In this paper, we propose a new heuristic algorithm for 
solving this feature selection problem. We reformulate the 
optimization problem as a bilevel optimization problem, in 
which the inner problem is linear Therefore, we use a de- 
terministic algorithm for solving the inner problem, which is 
nested into a general framework where a heuristic strategy 
is employed. Our computational experiments show that the 
proposed heuristic algorithm is able to find subsets of features 
allowing for consistent biclusterings. The obtained results are 
compared to the ones reported in other publications llj, lfT2l : 
in general, the heuristic algorithm that we propose is able to 
find consistent biclusterings in which the number of selected 
features is larger 

The remaining of the paper is organized as follows. In 
Section lUl we develop the concept of consistent biclustering 
in more details, and we present the corresponding feature 
selection problem. In Section [Till we reformulate this feature 
selection problem as a bilevel optimization problem and we 
introduce a heuristic algorithm for an efficient solution of the 
problem. Computational experiments on real-life sets of data 
are presented in Section lTVl as well as a comparison to another 
heuristic algorithm. Conclusions are given in Section W\ 

II. Consistent biclustering 

Let yl be an TO X n matrix related to a certain set of data, 
where samples are organized column by column, and their 
features are organized row by row. If a classification of the 
samples is known, then the centroids of each cluster, computed 
as the mean among all the members of the same cluster, can be 
computed. Let Cs be the matrix containing all these centroids, 
organized column by column, where its generic element 
refers to the i*'* feature of the centroid of the r*'* cluster of 
samples. Analogously, a matrix Cp containing the centroids of 



the clusters related to a known classification of the features can 
be defined. The generic element cj^ of the matrix Cp refers 
to the j*'* sample related to the centroid of the r*'* cluster of 
features. Finally, the symbol refers to the i*'* row of the 
matrix A, i.e. to a feature, and the symbol refers to the 
j*'' column of A, i.e. to a sample. In the following discussion, 
k represents the number of biclusters (known a priori), and 
r E {1,2, ...,fc} refers to the generic bicluster The symbols 
f and ^ are used for referring to biclusters having particular 
properties. 

Let us suppose that a classification for the samples in A is 
known. In other words, the following partition in k clusters is 
available: 

Bs — {Si,S2, ■ . . , Sk}- 

Starting from this classification, the matrix Cs of centroids can 
be computed. Given a feature a;, we can check the value of 
cf^ for all the clusters. If, for a certain cluster Sf, the element 
cff is the largest for any possible r, then Sf is the cluster 
in which the feature Ui is mostly expressed. Therefore, it is 
reasonable to give to this feature the same classification as the 
samples in Sf - Formally, it is imposed that: 

a,eFf ^ 4r>4 VCG {l,2,...,fc} ^^f. 

(3) 

Note that a complete classification of all the features can be 
obtained by imposing the equivalence (O for all a;. 
Let 

Bp — {Fi,F2, . . . ,Fk} 

be the computed classification of the features. Starting from 
this classification, the matrix Cp can be computed. In a 
similar way, a classification of the samples can be obtained 
by imposing the following equivalence: 

a^eSf ^ (^ff>cfi V^G{1,2,...,A} ^^f. 

(4) 

Let 

Bs = {Si, 5*2, ... , Sk} 

be the computed classification of the samples. In general, the 
two classifications Bs and Bs are different from each other. 
If they coincide, then the partition in biclusters 

B^{{Si,Fi),{S2,F2),...,iSk,Fk)} 

is, by definition, a consistent biclustering. As already remarked 
in the Introduction, the classification of the features obtained 
from consistent biclusterings can be exploited for classifying 
samples with an unknown classification [2]. 

If a consistent biclustering exists for a certain set of data, 
then it is said to be biclustering-admitting. However, sets of 
data admitting consistent biclusterings are very rare. Therefore, 
features must be removed from the set of data for making 
it become biclustering-admitting |2|. During this process, it 
is very important to remove the least possible number of 
features, in order to preserve the information in the set of 
data. In practice, a maximal subset of good features must 
be extracted from the initial set. The problem of finding 



the maximal consistent biclustering can be seen as a feature 
selection problem. 

Let fir be a binary parameter which indicates if the generic 
feature belongs to the generic cluster Fr (fir — 1) or not 
(fir = 0). Let X = {xi,X2, ■ ■ ■ ,Xm} be a binary vector of 
variables, where Xi is 1 if the feature is selected, and it is 
otherwise. The problem of finding a consistent biclustering 
considering the maximum possible number of features can be 
formulated as follows: 



max I f{x) = ^ : 



(5) 



subject, Vf,^ e {l,2,...,fc},f e Sf, to: 

m m 



> a 



J m 



(8) 



where each aj > 0. Similarly, the problem of finding a [3- 
consistent biclustering with a maximal number of features is 
equivalent to solving the optimization problem: 



i=l 



max 



fix) = Xl^* 



(9) 



subject, Vf,^ 6 {l,2,...,fc},f e Sf, to: 



subject, Vf,^ G {l,2,...,fc},f £ S*?, to: 



i=l 



> 



1=1 



(6) 



The generic constraint ^ ensures that the f-th feature is the 
mostly expressed if it belongs to the cluster (5,-, Ff). Note that 
the two fractions are used for computing the centroids of the 
clusters of features, and that the sums (at the numerators and 
at the denominators) only consider the selected features (each 
unselected feature is automatically discarded because Xi — 0). 
The reader is referred to IH for additional details. 

In this context, other two optimization problems have also 
been introduced 1 12 |. They are extensions of the problem 
(|6]l, which have been proposed in order to overcome some 
problems related to data affected by noise. If a partition in 
clusters for the samples is available, then we can find a 
partition in clusters for the features. Each feature is therefore 
assigned to the cluster if cf- is the centroid with the largest 
value. Let us suppose that the following condition holds for a 
certain feature Ui: 



niin{cff. -cfj <e 



where e is a small positive real number. If this is the case, 
small changes (i.e.: noise) in the data can bring to different 
partitions of the features, because the margin between cf^ and 
other centroids is very small. 

In order to overcome this problem, the concepts of a- 
consistent biclustering and ^-consistent biclustering have been 
introduced in lfT2l . They bring to the formulation of the fol- 
lowing two optimization problems. The problem of finding an 
a-consistent biclustering with a maximal number of features 
is equivalent to solving the optimization problem: 



max 



fix) 



4=1 



(7) 



2 = 1 



>/3, X 



^1 

m 

E 

i=l 



(10) 



where each /3j > 1. All the presented optimization problems 
are NP-hard |8|. The reader who is interested in more infor- 
mation on the formulation of these optimization problems can 
refer to lfT2l . lfT4l . For a simple and ampler discussion on 
biclustering, refer to ifTTl. 

The three optimization problems (IS])-®, Q-® and (|9]l- 
( fTOl ) are linear fractional 0-1 optimization problems. In ||2l, 
a possible linearization of the problem has been studied. 
However, the authors noted that currently available solvers 
for mixed integer programming are not able to solve the 
considered linearization, due to the large number of variables 
which are usually involved when dealing with real-life data. 
Therefore, they presented a heuristic algorithm for the solution 
of these problems, which is based on the solution of a 
sequence of linear 0-1 (non-fractional) optimization problems. 
Successively, in lfT2l . another heuristic algorithm has been 
proposed, where a sequence of continuous linear optimization 
problems needs to be solved. The heuristic algorithm we 
propose is able to provide better solutions with respect to the 
ones provided by these two. 

III. An improved heuristic 

In the following discussion, only the optimization problem 
Q-® will be considered, because similar observations can 
be made for the other two problems. The computational 
experiments reported in Section HVl however, will be related 
to all three optimization problems. 

We propose a reformulation of the problem (|5])-® as a 
bilevel optimization problem. To this aim, we substitute the 
denominators in the constraints ^ with new variables yr, r — 
1,2, ... ,k, where each yr is related to the generic bicluster. 
Then, we can rewrite the constraints ® as follows: 



^ 7/t ^ IIL 

^ ^ (^ijfifXi > ^ ^ 0>ij fi^Xi 

i=i z=i 



(11) 



The constraints ( fTTI ) must be satisfied for all f,^ G 
{1, 2, . . . , fc}, f 7^ ^ and for all j G Sf. 

Let us consider a set of values of j/,., and also another 
proportional set of values ijr = Syr, with (5 > 0. It is easy 
to see that, given certain values for the variables Xi, with 
i = 1,2, ... ,m, the constraints (fTTI ) are satisfied with yr 
if and only if they are satisfied with jjr. As an example, 
if fc = 3 and there is a consistent biclustering in which 
20, 30 and 50 features are selected in the fc biclusters, then 
the constraints ( fTTT l are also satisfied if 0.20, 0.30 and 0.50, 
respectively, replace the actual number of features (in this 
example, the proportional factor S is 0.01). For this reason, 
the variables y^ can be used for representing the proportions 
among the cardinalities of the clusters of features. In the 
previous example, 20% of the selected features are in the first 
bicluster, 30% of the features in the second one, and 50% 
in the last one. The variables j/,. can be bound in the real 
interval [0, 1], and the following constraint can be included in 
the optimization problem: 



Algorithm 1 A heuristic algorithm for feature selection. 

0: let iter = 0; 

0: let Xi = 1, Vi e {1, 2, . . . , to}; 

0: let yr = '£^ fir/m, Vr G {1, 2, . . . , fc}; 

0: let range = starting _range; 

while (g{x,y) > and range < max_range) do 
let iter = iter + 1; 

solve the inner optimization problem (linear & cont.); 
if {g{x, y) > 0) then 

increase range; 

if (g{x,y) has improved) then 
range — starting_range; 

end if 

let r' = random in {1,2,..., fc}; 
choose randomly y^' in [yr' — range, yr' + range]; 
let r" = random in {1,2,..., fc} such that r' ^ r"; 
set yr" so that Y,r Vr = 1; 
end if 
end while 



^y, = 1. (12) 
We introduce the function: 

c[x,yf,y{) = ^ I —^aijfi^Xi -^aijJifXi | + , 



Vf 



where the symbol | ■ |_|_ represents the function which returns 
its argument if it is positive, and it returns otherwise. As a 
consequence, the value of this function is positive if and only 
if the corresponding constraints ( fTTI ) are not satisfied. Finally, 
we reformulate the optimization problem (l5]l-(|6]l as the bilevel 
optimization problem: 



mm I g{x,y) 

V 



^^c{x,yf,y^) 

f—1 ^T^f 



(13) 



subject to: 



X — argmax j f{x) = : 



subject to constraint ([TT]) . 



(14) 



The objective function g of the outer problem is the sum of 
several terms which correspond to the function c{x, yf, y^) for 
each f and ^ G {1,2,..., fc}, with ^ ^ r. The minimization of 
all the terms of g brings to the identification of biclusterings 
in which the constraints (fTTT l are all satisfied. If this is the 
case, the found biclustering is consistent. 

AlgorithmfTfis a sketch of our heuristic algorithm for feature 
selection by consistent biclustering. At the beginning, the 
variables Xi are all set to 1, and the variables yr are set so that 
they represent the distribution of all the m features among the 
fc clusters. Therefore, if the biclustering is already consistent. 



then the function g is with this choice for the variables, and 
all the features can be selected. In this case, the condition in 
the while loop is not satisfied and the algorithm ends. 

At each step of the algorithm, the inner optimization prob- 
lem is solved. It is a linear 0-1 optimization problem, and we 
consider its continuous relaxation, i.e. we allow the variables 
X to take any real value in the interval [0, 1]. Therefore, after a 
solution has been obtained, we substitute the fractional values 
of Xi with if < 1/2, or with 1 if > 1/2. Moreover, 
in the experiments, the strict inequality of the constraints (fTTI) 
is relaxed, so that the domains defined by the constraints are 
closed domains. In these hypotheses, the optimization problem 
can be solved by commonly used solvers for mixed integer 
linear programming (MILP). In our experiments, we employ 
the ILOG CPLEX solver (version 11) Q. 

After the solution of the inner problem, the function g is 
evaluated. If the obtained values for the variables Xi, together 
with the used values for the variables yr, correspond to a value 
for g equal to 0, then the outer problem is also solved and the 
algorithm stops. Otherwise, some parameters and variables are 
modified in order to get ready for the next iteration of the 
algorithm. 

The heuristic part of this algorithm takes inspiration from 
the Variable Neighborhood Search (VNS) H, ITO], which is 
one of the most successful meta-heuristic searches for global 
optimization fT5\. The variables j/,. are randomly modified 
during the algorithm: at each step, two of such variables yr' 
and yr" are chosen randomly so that r' ^ r" . Then, y^' is 
perturbed, and its value is chosen randomly in the interval 
centered in the previous value of y^' and with length 2 x range. 
As in VNS, the considered interval is relatively small during 
the first iterations, in order to focus the search in neighbors 
of the current variable values. Then, the interval is increased 
and increased. However, it is set back to its starting size 
when better solutions are found. By employing this strategy 



a 





1 


2 


5 


10 


/(^) 


7450 


7448 


7444 


7413 


7261 




p 


1 


1.01 


1.50 


2.00 


3.00 




7450 


7450 


7107 


6267 


5365 



TABLE I 

Computational experiments on a set of samples from normal 
and cancer tissues. the features are selected by finding an 
a-conslstent or /3-conslstent biclustering. 



borrowed from VNS, every time there is a new improvement 
on the objective function value, the search is initially focused 
in neighbors of the current solution, and then it is extended to 
the whole search domain. When the considered interval gets 
too large (max_range), then the search is stopped, because 
there are low probabilities to find better solutions. After having 
chosen a value for j/^s a new value for y^" is computed so 
that the constraint on all the variables i/r is satisfied. Note that, 
for values of range large enough, the randomly computed y^' 
could be such that 

In this case, there are no possible values for y^" in [0, 1] for 
which the constraint (fT2] i can be satisfied. In order to overcome 
this issue, too large values for range are avoided. 

For its nature, the proposed heuristic algorithm can provide 
different solutions if it is executed more than once (with dif- 
ferent seeds for the generator of random numbers). Therefore, 
the algorithm can be executed a given number of times and 
the best obtained solution can be taken into consideration. 

IV. Computational experiments 

We implemented the presented heuristic algorithm for fea- 
ture selection in AMPL lH], from which the ILOG CPLEXll 
solver is invoked for the solution of the inner optimization 
problem. Experiments are carried out on an Intel Core 2 CPU 
6400 @ 2.13 GHz with 4GB RAM, running Linux. 

The first set of data that we consider is a set of gene 
expressions related to human tissues from healthy and sick 
(affected by cancer) patients ifTsl . This set of data is available 
on the web site of the Princeton University (see the paper for 
the web link). It contains 36 samples classified as normal or 
cancer, and each sample is specified through 7457 features. 
We applied our heuristic algorithm for finding a consistent 
biclustering for the samples and the features contained in this 
set of data. 

Table U shows some computational experiments. We found 
a-consistent biclusterings and /3-consistent biclusterings, with 
different values for a or /3. Note that, even though for each 
sample a different aj or f3j can be considered, we use one 
unique value for a and /? in each experiment. In the table, the 
number of selected features f{x) is given in correspondence 
with each experiment. 

When a = or /3 = 1 (consistent biclustering), after 
4 iterations only (41 seconds of CPU time), our heuristic 
algorithm is able to provide the list of selected features, and 
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7081 
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7024 
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lU 


7076 
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7024 
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20 


7075 
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7018 
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30 


7072 
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7014 
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40 


7068 
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7010 
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50 


7061 




6959 
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60 


7046 




6989 
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70 


6954 




6960 
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13 
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err 
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err 


1.00 


7081 
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7024 
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1.05 


7075 


2 


7017 


2 


1.10 


7068 


2 


7010 




1.20 


7020 


1 


6937 




1.50 


6590 


1 


6508 




2.00 


5987 


1 


5905 




3.00 


5527 


2 


5458 




5.00 


5238 


2 


5173 


2 



TABLE II 

Computational experiments on a set of samples from patients 
diagnosed with all or aml diseases . the features are 
selected by finding an a-conslstent or /3-conslstent 
biclustering. 



thus to identify the (few) features to be removed in order to 
have a consistent biclustering. In particular, 7 features on 7457 
need to be removed (and therefore 7450 features are selected). 

The bilevel optimization problem to be solved gets harder 
in the case of a-consistent and /3-consistent biclustering. As 
expected, less features are selected when larger a or /3 values 
are chosen, because the constraints (fTTT i are more difficult to 
be satisfied. However, using larger values for a and (3 allows 
for identifying the features that are actually important for the 
classification of the samples. The computational cost of our 
heuristic algorithm increases when larger a or /3 values are 
used: some of the presented experiments need some minutes 
of CPU time to be performed. 

The second real-life set of data we consider consists of 
samples from patients diagnosed with acute lymphoblastic 
leukemia (ALL) or acute myeloid leukemia (AML) diseases 
P) (to download the set of data, follow the link given in the 
reference). This set of data is divided in a training set, which 
we use for finding consistent biclusterings, and a validation 
set, which can be used for checking the quality of the classi- 
fications performed by using the features previously selected. 
The training set contains 38 samples: 27 ALL samples and 1 1 
AML samples. The validation set contains 34 samples: 20 ALL 
samples and 14 AML samples. The total number of features 
in both sets of data is 7129. Since, in this case, a validation 
set is also available, we are able to validate the quality of the 
obtained biclusterings in correspondence with different values 
for the chosen parameter a or f3. 

The results of our experiments are in Table The total 
number of features that are selected in each experiment is 
reported, together with the number err of misclassifications 
that occur when the samples of the validation set are classified 
accordingly with the classification of the features in the a- 
consistent or /3-consistent biclusterings. When a = or 
/3 = 1, our heuristic algorithm is able to find a consistent 



biclustering, but the selected features are not able to provide 
a correct classification for all the samples of the validation 
set (err = 2). This is due to the fact that the used data 
are probably noisy, because they have been obtained from an 
experimental technique. However, the number err of misclas- 
sifications decreases when a or (3 increase. For example, for 
a > 50, there is only one misclassification for the samples of 
the validation set. 

In Table [III we also compare the obtained results to the 
ones reported in lfT2ll . Our heuristic algorithm is able to 
provide better-quality solutions in the majority of the cases. In 
particular, for given choices of a or /3, our heuristic algorithm 
is able to find biclusterings in which the total number of 
selected features is larger, except for only one experiment 
(a = 70). These biclusterings allow to perform good-quality 
classifications (err = 1 or 2), while a larger number of features 
in the set of data are preserved. 

V. Conclusions 

We proposed a reformulation for the linear fractional 0- 
1 optimization problem for feature selection by consistent 
biclustering. Our reformulation transforms the problem into 
a bilevel optimization problem, in which the inner problem 
is linear We presented a heuristic algorithm for the solution 
of the reformulated problem, where the continuous relaxation 
of the inner problem is solved exactly at each iteration of 
the algorithm. Computational experiments showed that the 
proposed algorithm can solve feature selection problems by 
finding consistent, a-consistent and /^-consistent biclusterings 
of a given set of data. The results also showed that this algo- 
rithm is able to find better solutions with respect to the ones 
obtained by previously proposed heuristic algorithms. Future 
works will be devoted to suitable strategies for improving the 
efficiency of the proposed algorithm. 
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